Distinguishing examples while building concepts in hippocampal and artificial networks

The hippocampal subfield CA3 is thought to function as an auto-associative network that stores experiences as memories. Information from these experiences arrives directly from the entorhinal cortex as well as indirectly through the dentate gyrus, which performs sparsification and decorrelation. The computational purpose for these dual input pathways has not been firmly established. We model CA3 as a Hopfield-like network that stores both dense, correlated encodings and sparse, decorrelated encodings. As more memories are stored, the former merge along shared features while the latter remain distinct. We verify our model’s prediction in rat CA3 place cells, which exhibit more distinct tuning during theta phases with sparser activity. Finally, we find that neural networks trained in multitask learning benefit from a loss term that promotes both correlated and decorrelated representations. Thus, the complementary encodings we have found in CA3 can provide broad computational advantages for solving complex tasks.

explorafion of the complementary encoding strategies employed by the hippocampal circuit.The proposed computafional model is both elegant and intuifive, and their experimental findings provide compelling evidence for the existence of these disfinct encoding modes.It is interesfing that the authors demonstrated that these computafional properfies found in hippocampal circuits could be applied to improve machine learning performance.This paper represents an important contribufion to our understanding of the neural mechanisms underlying learning and memory.
Below are the major/minor points I would love the authors to address.
Major points: 1, The retrieval of PP pafterns appears similar to the average image in one category, and the authors treated this as retrieving a concept of that category.Can the authors provide more evidence about why this mean image can represent the concept of a 'concept'?Is this mean image an analysis arfifact due to the simple dataset where images belonging to a single category do not have large variance, so the averaged image can sfill be recognized as sneakers, trousers, or coats.In other words, the authors could provide more evidence for why the mean image can represent the concept of a category by discussing previous studies that have used similar methods and how they have validated their approach.They could also address the concern about the simple dataset used and discuss whether the mean image approach would sfill hold for more complex datasets (such as Cifar10).
2, The threshold \theta in the sigmoid funcfion is controlled by the width of the funcfion, but how is the changing of the width being linked to the hippocampal theta oscillafion?More importantly, the oscillafing threshold does not necessarily represent the oscillafion rhythm at the theta band.The authors might want to look at the Pfeiffer and Foster (2015) paper where they found the slow gamma oscillafions route the informafion flow during offline states, i.e., the decoded trajectory from awake SWRs.

Minor points:
1, Paftern separafion through the MF pathway might take fime.In this case, will the MF and PP encodings arrive at CA3 at the same fime?If one arrives earlier than the other, will it sfill be a simple addifion in the mathemafical model?2, The authors could explain why they chose to use the autoencoder to represent the neural representafion of EC neurons and provide some evidence for why this approach is valid.
3, To my understanding, the synapfic connecfions in the Hopfield-like network (Eq.2) will give aftractor states as the superimposed pafterns, e.g., 0.9x_i^{MF}+0.1x_i^{PP}.How can the network retrieve either x_i^{MF} or x_i^{PP}?4, It will be clearer if simply do a histogram plot of the spike numbers against LFP theta phase to show the theta modulafion of place cell acfivity (line 211 and Fig. 4B, making the histogram on the right side of panel B bigger). 5, Is the Wilcoxon signed-rank test in Fig. 4K two-tailed or one-tailed?
Overall, the authors provided compelling evidence for the existence of complementary encoding modes in CA3, and their findings shed new light on the neural mechanisms underlying learning and memory.If the authors could address the issues I raised, parficularly the major points related to the concept representafion and the link between the threshold and hippocampal theta oscillafion, then I would be happy to recommend this paper for publicafion.
Reviewer #3 (Remarks to the Author): The authors present a paper arguing first that different pathways converging upon hippocampal area CA3 may preferenfially code for more episode-unique (DG-CA3) and more conceptual (EC-CA3) informafion.Next, they show evidence for an alternafion of biases across the hippocampal theta rhythm towards sparser (more episodic) and then coarser (more conceptual) informafion in CA3 place cells, which was highly impressive.Finally, they this type of parallel coding can improve learning in a more general neural network architecture.I found the paper impressive overall in its ambifion and scope.It was well-wriften and well-argued, with very few mistakes or unclear passages, and it was very thorough in terms of having strong primary findings as well as nice ancillary findings.Here are some comments that I hope the authors will consider: Intro/Discussion: Theta power changes during various tasks [e.g., increases at successful encoding (Jacobs et al., 2006;numerous Michael Kahana papers) or explorafion in rodents (countless papers)].Does theta power mean more sparsity overall or less?Or just a wider range of sparsity levels across the cycle?Addressing this point could link these findings to vast literatures that largely report theta power values, and it could generate new hypotheses.

Smaller points
Abstract -first sentence: Using 'sensory' makes it seem as if the hippocampus does not store 'conceptual' informafion along with a memory trace.
"Meanwhile, recent research has found that the hippocampus also parficipates in semanfic memory, as evidenced by grandmother cells that generalize over your visits and respond to many different representafions of your grandmother (Quiroga et al., 2005(Quiroga et al., , 2009))."Norman et al. (2021, Science) is also relevant to this point.
p. 4 -"Sparsity is the fracfion of acfive neurons, so lower values correspond to sparser pafterns."As a concept, usually higher values of that concept indicate more of that concept, so, generally speaking, higher "sparsity" values would mean more sparse pafterns rather than less sparse ones.Ulfimately, I appreciate that the authors define this, and I respect their choice if they decide to keep this term, but perhaps they could consider using the opposite term, "density".
End of p. 4 -It may help to orient the reader to point out that after CA1, the signals would need to get back out to cortex.Fig 2G -Does the PP concepts line go higher than the maximum achievable overlap?If so, is this an error, or have I misunderstood?Top p. 6 -"because since" typo p. 6 -"During retrieval, the network is asynchronously updated via Glauber dynamics (Amit et al., 1985).That is, at each simulafion fimestep, one neuron is randomly selected to be updated (Fig. 2C).If its total input from other neurons exceeds a threshold θ, then it is more likely to become acfive."In this scheme, is there only further acfivafion, or do acfive neurons also become randomly selected to possibly become inacfivated (balancing the overall acfivafion level)?Boftom p. 6 to p. 7 -For myself as a reader, these paragraphs were incredibly helpful -I think the authors really hammer the main points home here well.p. 10 -I very much appreciate this alternafive predicfion from the aftenfion literature.It clarifies the need to resolve it via data, as they do below.

Discussion:
By my reading, Kowadlo et al. (2019, arXiv) & Antony et al. (2022, bioRxiv) present computafional models that also argue for some generalizafion occurring in the PP (EC-CA3) pathway.The authors should consider how their model's predicfions support or differ from these papers.
(Addressing this point is opfional.)My broadest quesfion is how would these different theta phases playing different roles in both encoding and retrieval play out in the subjecfive encoding or retrieval of a memory?Do the authors believe a rodent or human can meaningfully alter (or pay aftenfion to) the sparsity of their representafions at the rate of a theta rhythm?If one were asked to recall the specific or conceptual associafions to a given cue, would a control mechanism (say, from the prefrontal cortex) properly tune the theta rhythm in some way to accomplish this task?I think speculafion along these lines would be helpful if the authors have insight, but if it is too speculafive, it's fine to omit.

Methods:
Line 527 -"expect" -> "except" Dear Nature Communications Editors and Reviewers: Below we explain how our new manuscript has addressed all the Reviewers' comments and is consequently stronger, clearer, and better connected with previous literature.Changes in the main text have been indicated in red, and all figure and line numberings invoked in this response refer to the revised version of the manuscript.

Reviewer 1
Major point 1: "A paper like this should excite both neuroscientists and AI/ML researchers, because the intersection (while growing) is not yet very large.In particular, the authors' machine learning tasks and secondary analysis of hippocampal datasets are both interesting, but they serve to highlight this gap instead of building a bridge across it.Given its length and disjointedness, I encourage the authors to consider whether this manuscript should be two papers, each of which might better target its respective audience.E.g., (1) tell AI/ML engineers about the algorithmic advances and potential for performance improvements, scalability, generalization, etc.; and (2) tell (computational) neuroscientists about the new theory of hippocampal memory encoding, the threshold-based retrieval function of theta oscillations, and how your secondary data analyses support those theoretical advances.If the authors (understandably) disagree, then I would encourage a revision that better acknowledges this gap and reframes its idea and findings in an interdisciplinary way that enhances relevance and impact across both domains.This could be largely organizational (vs.major rewriting), but it should regardless be carefully considered." We agree with the Reviewer on the importance of a strong connection between our neuroscience and ML components, and we welcome this opportunity to tie the two together more tightly.First, we reorganized the first figure to focus on the overarching organization of the entire work, with a new Fig.1C that schematically illustrates how its components tie together.The original manuscript conveyed how our investigation into CA3 encodings inspired the HalfCorr loss function, but it did not sufficiently explain how HalfCorr networks allow us to probe the function of CA3-like encodings on explicit learning tasks.Thus, we introduced new text in lines 77 and 510 emphasizing the point that training artificial networks enables us to study whether CA3-like encodings are indeed good for complementary computational tasks.With these changes, we feel that our manuscript not only motivates how our neuroscience results contribute to ML developments, but also how our ML results extend the significance of our neuroscience findings.
Major point 2: "The abstract ANN-based models of the MF and PP inputs to CA3 provide analytical tractability and applicability to AI/ML tasks, which are appropriately emphasized in the work and the manuscript.However, the design, parameterization, batch training, and analyses of the ANNs are based on many "magic numbers" sprinkled throughout the manuscript, e.g., in results, captions, methods, and supplementary information.One problem is legibility: without even a table of parameters and values, it is very difficult as a reader to put together a mental model of the systems of ANNs constructed in these studies, which includes subtleties such as how the autoencoder is used in different conditions.A second problem is robustness: the manuscript does not describe how all these values were determined.Did the authors conduct any hyperparameter optimization procedures?How many degrees of freedom do these models have in relation to the dimensionality of the problem domain?How sensitive are the findings (model results or data analyses) to changes in particular parameters?For example, the 0.1 weighting for PP vs. MF inputs in Eq. (2) seems to be important, but is it important only that PP < MF, or must it be close to 0.1?Ascribing parameter values of these abstract binary feedforward or "Hopfield-like" networks to "biological trends" or other vague hand-waving is poor justification.The authors should include more detail to justify their model design and training, e.g., hyperparameter optimization, regularization, cross-validation, sensitivity analyses, parametric dependences, etc.If in fact many of these values were chosen ad hoc or for convenience, then the strength of the paper's claims is undercut and these issues should be clarified in the discussion and elsewhere as appropriate."This is a good point.Our parameter choices were guided by theoretical results for networks that store random MF and PP patterns (Kang & Toyoizumi, arXiv:2302.04481, 2023), and we did not need to perform extensive fine-tuning.To confirm the generality of our results with respect to parameter values, we now present an extensive parametric exploration (Fig. S3F), encompassing PP pattern storage strength, pattern densities, sources of cue noise, and dendritic nonlinearity.Over wide ranges of each parameter, our CA3 model maintains its central capability of recovering both MF examples and PP concepts from both MF and PP cues.This supports the robustness of our network and argues against dependence on "magic numbers."Also, we now summarize our key model parameters in Table 1 in the Methods section.
Minor point 1: "Fig.1E: Autoencoder layer sizes typically decrease toward the middle hidden layer.Why does this network appear to have the inverse design?Is it due to the connectional sparsity?If so, should this still be called an autoencoder?" We wanted our EC network to be large to increase capacity and reduce finite-size effects, especially because our EC patterns are sparse.While it's true that many autoencoders implement a compressed middle layer for dense feature extraction, overcomplete autoencoders with an expanded middle layer are also used for unsupervised feature extraction, especially in the context of sparse coding.In fact, sparse, overcomplete networks are commonly used to model the encoding of natural scenes (Olshausen & Field, 1996;Makhzani & Frey, 2014;etc.).We have clarified these points in line 95.
Minor point 2: "Lines 64-68: These two sentences (starting "For instance...") are odd.First, we should be wary of labeling old ideas "classic", which is akin to an appeal to authority on the basis of age.Second, citing Quiroga's 2005 and 2009 papers as "recent" is quite a stretch, especially given the rate of change in neuroscience over the last 15 years.Lastly, the notion of "grandmother cells" argues against sparse distributed representations, so bringing it up as an example of generalization here appears to be a non sequitur: i.e., you could have sparse statistical or semantic generalization without denying distributed representations.Of course, if the authors in fact intended to deny that hippocampal attractor-based representations like those discussed here are not distributed, then that would be surprising and should be made explicit." We have added explicit references for episodic memory instead of using the term "classic," and we have removed the word "recent."We do not intend to argue against distributed representations in this passage, and we do not feel that the work of Quiroga and colleagues provides strong evidence against the possibility of distributed representations.For example, in Quiroga et al. (2005), only 43 out of 132 responsive units demonstrated specific responses to a single entity.This may be higher than chance level, but both neurons with singular tuning and those with more numerous responses are present.Moreover, "grandmother cells" may also code memories in a distributed fashion, with the possibilities of multiple units tuned to the same entity and singularly tuned units responding to other stimuli not tested-132 entities is still small compared to the enormous range of individuals and objects remembered over a lifetime.While we model memories as fully distributed in our Hopfield-like network, the biological system probably lies somewhere between a oneto-one coding of memories and a completely distributed coding with each neuron's participation determined randomly and independently.
Minor point 3: "Lines 69-81: This paragraph summarizes results from the study, so most of these verbs should be in past tense, not in present or future tenses.E.g., "We will test these predictions..." should be "We tested these predictions...", "each analysis will reveal that..." should be "each analysis revealed that...".This linguistic distinction clarifies for readers what you have specifically done in conducting the research supporting this paper." We appreciate this suggestion and have made changes accordingly.We use the past tense to describe explicit actions, such as constructing models and testing predictions.We use the present tense to describe the consequences of these actions, which are not tied to one moment in time.We hope that these guidelines are sensible to the Reviewer.
Minor point 4: "Ibid.:The last paragraph of the introduction states the main prediction in the paper as the relationship between network sparsity and sharper tuning.If we consider a Gaussian place field and a threshold, then we can imagine that this threshold moves up or down.This mental exercise makes it trivially clear that sparser/denser activation is geometrically coupled to sharper/broader spatial tuning.The authors should clarify throughout the paper how the main prediction or outcomes of their study is not simply an a priori consequence of this trivial geometric relationship."Indeed, the proposed function of the theta oscillation as an activity threshold is central to our model.While its effect on tuning appears obvious, to our knowledge, we do not know of any work that explicitly demonstrates that theta operates in this fashion.In lines 245 and 417, we set our work in a broader context by discussing subtractive and divisive inhibition, both of which have been observed throughout cortical networks.Our theta model argues for a subtractive role on firing rates, but we believe that a priori, the alternative hypothesis of a divisive role is equally valid.In fact, hippocampal pyramidal cells have been observed to experience divisive inhibition on their membrane potentials (Losonczy et al., 2010;Bhatia et al., 2019); downstream effects on firing rates have not been explored.Thus, although some readers may intuitively expect our model prediction, investigation into its consequences for CA3 tuning, especially with respect to the retrieval of different memory encodings, has not been as thoroughly performed before to our knowledge.
Minor point 5: "Line 87: "The sensory inputs that constitute memories in our model..." This must be clarified.Sensory inputs do not constitute memories in models of memory; these are very different things." We have clarified that the encodings are stored as memories, not the sensory inputs themselves, by changing this phrase to "The sensory inputs whose encodings serve as memories in our model are FashionMNIST images...." Minor point 6: "Line 92: "...random, binary, and sparse..." Is this supposed to be respective to the DG, MF, and PP encodings?This is a good example of a sentence that raises more questions than information it provides.E.g., aren't all of these projections binary?If all projections are random, binary, and sparse, then how and why do the parameters or distributions of randomness and sparsity vary across the projections?" We have now made explicit the connections that are modeled by these types of matrices (line 101).Yes, the EC --> DG, DG --> MF, and EC --> PP projections are all binary.We have added a phrase at the end of the Introduction section to alert the reader that parameter descriptions, justifications, and values are found in the Methods section and Table 1 (line 82).Across projections, synapses are always randomly chosen such that each postsynaptic neuron receives a fixed number of excitatory synapses from presynaptic neurons.These fixed numbers, or projection sparsities, are chosen to follow biological trends; due to uncertainties in experimental measurements and in the proper way to rescale biological counts for model networks with much fewer neurons, precise determination of these sparsities is not possible.We have also explicitly mentioned in the Methods section that based on our theory, we do not expect projection sparsity to be a crucial parameter (line 566).We hope that these changes improve the understanding of our model.Minor point 7: "Line 98: (Addressing the authors here.)I understand that you want your variable 'sparsity' to map naturally to [0,1], but to define 'sparsity' as the inverse of the actual meaning of the English word "sparsity" is very confusing.E.g., on Line 106, you equate "decreasing sparsity" with "sparsification", which would be defined in English as "increasing sparsity"!Then, on Line 110, you introduce 'sparsity' as a_pre, which seems to reflect an "activity" level (as in, 'a' for "activity").I would strongly encourage the authors to consider refactoring this variable name from 'sparsity' to 'active fraction' or something similar.Then, "decreasing active fraction" clearly means "less activity" and thus "sparsification", with no need for mind-bending mental translation on the part of your readers." We appreciate the Reviewer's suggestion, which is echoed in Reviewer 3's minor point 3, and we use Reviewer 3's term "density" to describe the variable a.
Minor point 8: "Line 118-124: To solve the problem of translating CA3 patterns back to images, the authors train a continuous-valued feedforward EC-*CA3 network to feed CA3 output to the decoder half of the autoencoder.This seems like a very complicated way to include a CA1 network.The unexplained complexity here raises a number of questions.Why is this image-generating network continuous-valued, but the main EC-*CA3 networks are binary?Why not add a CA1 module and train the whole trisynaptic loop?Is this effectively the same?Is this network only for research visualization purposes?Wouldn't the brain need to convert CA3 output into reconstructed memories anyway?So why not study this additional network in the same context?Does the decodability of the two encodings rely on an assumption of disentangled representations, e.g., via linear superposition?" Yes, the decoding network is for visualization only.We clarify this point in line 123 and with a new label in Fig. 2B.Indeed, the brain would need to convert CA3 outputs back into reconstructed memories.However, we find that modelling this reconstruction process is beyond the scope of our manuscript.To model outputs through CA1, one would need to also consider the temporoammonic projection from EC back to CA1, as well as additional projections from CA1 to subiculum, from subiculum to EC, and from EC to subiculum.Thus, carefully making claims about CA1 would require a much more complex model than the one we have constructed to study CA3.Since we do not intend to model CA1, we use continuous-valued neurons.We expect that using binary neurons would not significantly affect the decoding performance of the network, especially with an increased number of hidden layer neurons, as supported by many studies on quantization in neural networks (Hubara et al., Adv. NIPS, 2016;etc.).
Minor point 9: "Line 127: The assumption of linear dendritic integration between proximal and distal sites is simply false (cf.Mel, Poirazi, et alia).If the relative weighting of 0.1 PP vs. 0.9 MF is based on this assumption, then the resultant findings of threshold-based effects that depend on that weighting and its linearity may not hold up to scrutiny.The authors should provide additional context for understanding this assumption and its follow-on implications in the results and discussion sections." We now discuss nonlinear dendritic integration more extensively (line 134) and perform simulations that implement it (Fig. S3F).With binary dendritic inputs, only one type of nonlinearity is possible: given the strength of, say, 0.1 for an active PP input and 0.9 for an active MF input, what is the strength when both are active?It is either 1 (linear), less than 1 (sublinear), or greater than 1 (superlinear).With both sublinear and superlinear values, our network can still function (Fig. S3F), demonstrating that our assumption of linear integration is not crucial to our results.
Minor point 10: "Line 157: It's unclear whether the authors (or others in the field) might have expected a "concept" representation to reflect a simple average of a set of exemplars.This seems like a surprising expectation to me and potentially underlies the primary distinction this paper is making between "examples" and "concepts".It is not difficult to think of many reasons why an average example may not serve the required abstractions and other functions of a concept.This relationship between examples and concepts should be clarified." This concern is also shared in Reviewer 2's major point 1.We have now provided additional information at lines 170 and 178, as well as a new Fig.S2A.This figure shows that examples from each concept form well-separated clusters in image space.Thus, mean images, which lie at their centroids, can serve as concept representations.We have also discussed that with more complex datasets, the mean image approach may not work, and we would suggest averaging features of a trained classifier or a variational autoencoder (line 181).
Minor point 11: "Fig.3C: Pie charts are difficult to parse for many people.Please consider using bar or stacked charts instead." We have made this change.
Minor point 12: "Fig.4D: This shows the trivial geometric relationship between sparsity and tuning that I described above.
The "alternative prediction" in the right-hand panel does not convey equivalent empirical content as the "model prediction" on the left, and is thus much easier to dismiss.Treating these two possibilities as equally likely is equivalent to overclaiming the statistical power of your inference.If this is not the case, then the authors should clarify why the left panel does not reflect a trivial relationship and the right reflects an equally likely possibility given what we (in the field) already know about how place cells fire as a rodent traverses a place field." Please see our response to minor point 4.
Minor point 13: "Line 273: Use of 4 spatial bins is confounded with place-field size, so the relative offset of the bins should have also been considered to avoid field-edge effects.The authors should clarify and justify how they determined bin numbers, sizes, and offsets to compute spatial information of place-cell firing." For Fig. 6, we intentionally did not avoid splitting up fields with our partitions because the opposite scenario was adopted in Fig. 5, where intact place fields were extracted.Thus, we included all offsets that met minimum spiking requirements, as described in Methods.We added text to line 302 that clarifies our motivations.As for the bin number, distinguishing among 4 quadrants of the track seemed to reasonably capture the broadest position representation that has behavioral relevance.Note that bin number must be preserved across scales to avoid bias (Fig. S6F-H).At small scales, distinguishing between fewer bins, say only 2 adjacent bins of scale 1/16, seemed to be a too narrow information metric.Meanwhile, estimating information with more bins requires more spikes and would decrease the number of valid neurons.Balancing these two factors led us to choose 4 bins.
Minor point 14: "Line 391: "experimental findings" → "secondary analysis of experimental data"" We have made this substitution.
Minor point 15: "Line 410: In discussing Skaggs (1996), the authors compare hypothetical results after mentioning differences in binning technique and offsets.Instead of hand-waving and speculating, why not go ahead and replicate the Skaggs binning technique so you can show us whether or how much it matters?Additionally, the authors neglected to discuss another highly relevant and more recent study that likewise conducted a secondary analysis of the hc-3 dataset: Souza & Tort.Asymmetry of the temporal code for space by hippocampal place cells.Scientific Reports.2017;7(1):8507." The Skaggs et al. (1996) binning technique is designed for 2D environments, whereas the data we analyzed were all taken in 1D environments.Of course, it is possible to translate this technique to linear tracks, but we do not feel that the modified comparison would provide sufficient benefit.We appreciate the Souza & Tort (2017) reference and now discuss it briefly in line 446.
Minor point 16: "Line 480: N=256 images seems like a very small trainset.The authors should provide additional context and justification for low-sample training, including their restriction to a small number of categories in the FashionMNIST dataset.Substantial downselection of the trainset could impact the generalizability and robustness of model inference based on these ANNs." We store a relatively small amount of examples (up to 100 in a network) within a relatively small number of concepts (3) because our CA3 network is relatively small with only 2048 neurons.Moreover, the crucial transition from recovering PP examples to recovering PP concepts can be observed within this explored range.With larger networks, more examples and concepts can be used.In Fig. 3H, I; Fig. S3G, H; and Fig. S4, we use larger networks that store random MF and PP patterns to demonstrate that our network behaviors can generalize.We hope that our new emphasis of this point in line 204 argues for the generalizability of our results.
Minor point 17: "Line 524: The authors state that cues are formed by flipping only 1% of all neurons in a given target pattern.
From this, should I infer that the primary pattern completion inference problem in this the study is to correct a cue that is already a 99% match to a pre-trained target pattern?Is the network being initialized to a state that is 99% of the way to a correct answer?If this is the case, then I fail to see how the problem-solving performance of these network models can be assessed given the data presented." We originally chose to flip only 1% of all neurons because MF patterns have a density of 2%, so this process adds noise whose quantity is 50% of the original pattern, which can be considered the signal.However, it is true that 1% flipping is much less significant for PP patterns with a density of 20%.Moreover, for both MF and PP patterns, only 1% the active neurons in the target pattern are inactivated, so the signal largely remains intact.To address this good point, we perform simulations in which more neurons are flipped; recovery of both MF examples and PP concepts from both MF and PP cues can be largely maintained up to 10% of random flipping of all neurons, active and inactive (cue inaccuracy in Fig. S3F).Instead of random flipping, we also consider randomly inactivating a fraction of the active neurons to form cues (cue incompleteness in Fig. S3F).Up to 40% inactivation of the active neurons can be tolerated without a considerable decline in performance.We have also added accompanying text in lines 190 and 586.

Reviewer 2
Major point 1: "The retrieval of PP patterns appears similar to the average image in one category, and the authors treated this as retrieving a concept of that category.Can the authors provide more evidence about why this mean image can represent the concept of a 'concept'?Is this mean image an analysis artifact due to the simple dataset where images belonging to a single category do not have large variance, so the averaged image can still be recognized as sneakers, trousers, or coats.In other words, the authors could provide more evidence for why the mean image can represent the concept of a category by discussing previous studies that have used similar methods and how they have validated their approach.They could also address the concern about the simple dataset used and discuss whether the mean image approach would still hold for more complex datasets (such as Cifar10)." We have clarified this point now with additional information at lines 170 and 178, as well as a new Fig.S2A.This figure confirms the Reviewer's intuition that examples from each concept form well-separated clusters in image space.Thus, mean images, which lie at their centroids, can serve as concept representations.With more complex datasets, the mean image approach may not work, and we would suggest averaging features of a trained classifier (line 181).Alternatively, more elaborate unsupervised dimensionality reduction techniques such as variational autoencoders may extract features appropriate for concept formation.Here we used a simple encoder and decoder to study recall properties of an associative memory network.
Major point 2: "The threshold \theta in the sigmoid function is controlled by the width of the function, but how is the changing of the width being linked to the hippocampal theta oscillation?More importantly, the oscillating threshold does not necessarily represent the oscillation rhythm at the theta band.The authors might want to look at the Pfeiffer and Foster (2015) paper where they found the slow gamma oscillations route the information flow during offline states, i.e., the decoded trajectory from awake SWRs." We regret to have misinformed the Reviewer; the threshold \theta is the position of the center of the sigmoid function.A higher threshold requires each neuron to receive more recurrent excitatory input in order to fire; thus, it models a general inhibitory tone applied to all principal cells.As the Reviewer may already appreciate, the theta oscillation is believed to serve as this inhibitory tone, since medial septum inputs that drive the oscillation modulate the inhibitory interneurons in CA3.To prevent this confusion, we have removed the label "threshold softness" from Fig. 3C.
The width of the sigmoid function is the temperature of the Glauber dynamics (Eq.11), which contributes a degree of randomness during pattern retrieval.It models noise in biological neurons, which may not always fire when their membrane potential exceeds a fixed threshold value by a small amount.This temperature is fixed for all our simulations.
The possibility that a different oscillation can serve as the varying threshold in our model during offline states is very interesting.We thank the Reviewer for this reference, and we now discuss this possibility in line 450.
Minor point 1: "Pattern separation through the MF pathway might take time.In this case, will the MF and PP encodings arrive at CA3 at the same time?If one arrives earlier than the other, will it still be a simple addition in the mathematical model?" We consider this point in line 468.With the symmetric Hebbian STDP in our model, small timing differences would not be consequential, but with an asymmetric temporal component, PP-to-MF connections could be stronger than MF-to-PP connections, which would presumably favor the heteroassociation from PP concepts to MF examples.However, we note that if each memory's encodings are presented tonically to CA3-over the course of a second, for example-then the short temporal difference of tens of milliseconds due to transmission through DG would not have a significant impact.
Minor point 2: "The authors could explain why they chose to use the autoencoder to represent the neural representation of EC neurons and provide some evidence for why this approach is valid." In line 95, we now explain that we follow previous works that modeled the neural encoding of natural scenes with sparse, overcomplete coding models and autoencoders (Olshausen & Field, 1996;Makhzani & Frey, 2014;etc.).While our images are not natural scenes per se and EC lies many layers away, so to speak, from V1 in the visual pathway, a sparse, overcomplete autoencoder provides many desired features for our model: tunable sparsity, a large number of EC neurons to reduce finite-size effects in our simulations, and the learning of memories in an unsupervised way.It is true that other architectures, such as deep convolutional neural networks for image classification (Yamins & DiCarlo, 2016;etc.)have also been invoked to study visual pathways.However, we felt that this supervised approach would conflict with the unsupervised building of concepts that we wanted to demonstrate in our model.
Minor point 3: "To my understanding, the synaptic connections in the Hopfield-like network (Eq.2) will give attractor states as the superimposed patterns, e.g., 0.9x_i"{MF}+0.1x_i"{PP}.How can the network retrieve either x_i"{MF} or x_i"{PP}?"Indeed our network attempts to retrieve something like the linear combination 0.9 x'MF + 0.1 x'PP; however, it consists of binary neurons, so values between 0 and 1 are not accessible.At high threshold, neural activity is disfavored, so only the most strongly stored component would be recovered: x'MF.As the threshold is lowered, more neurons are encouraged to activate, so neurons participating in the less strongly stored component would also activate: x'PP.Thus, the recovered pattern would encompass active neurons in either x'MF or x'PP, all of which take the same activity value 1.Since x'MF is much sparser than x'PP, this combined pattern is very similar to x'PP, so x'PP is approximately recovered at low threshold.We have additional explanation to line 158 which we hope makes this process clearer.
Minor point 4: "It will be clearer if simply do a histogram plot of the spike numbers against LFP theta phase to show the theta modulation of place cell activity (line 211 and Fig. 4B, making the histogram on the right side of panel B bigger)." We take this good suggestion and plot histograms for 5 cells in Fig. 5C.
Minor point 5: "Is the Wilcoxon signed-rank test in Fig. 4K two-tailed or one-tailed?" The test is two-tailed, as is now specified for all Wilcoxon signed-rank tests and Mann-Whitney U tests.

Reviewer 3
Major point 1: "Intro/Discussion: Theta power changes during various tasks [e.g., increases at successful encoding (Jacobs et al., 2006;numerous Michael Kahana papers) or exploration in rodents (countless papers)].Does theta power mean more sparsity overall or less?Or just a wider range of sparsity levels across the cycle?Addressing this point could link these findings to vast literatures that largely report theta power values, and it could generate new hypotheses." We thank the Reviewer for pointing us to this line of research, with which we now engage in our manuscript.As the amplitude of an oscillation, we believe that theta power would be most likely related to the range of sparsity levels, which determines the ease with which both example and concept encodings can be accessed.We perform new simulations to touch upon this topic; to study theta power during memory retrieval, we change the amplitude of our threshold oscillation (Fig 4A , C).We find that with reduced oscillation amplitude, the network lingers on sparse example encodings instead of alternating with dense concept encodings.And by also changing the oscillation midpoint, it can linger on a dense concept (not shown).Ultimately, a more physiologically accurate model would be required to elucidate the role of theta power, whose modulation is strongly connected to changes in firing rates.We use neurons with only two states in our model.Our new simulation results are complemented by an extended discussion (line 474) that not only covers the points mentioned in this response, but also addresses possible roles of theta power during memory encoding, which was also mentioned by the Reviewer.
Major point 2: "Figure 1 -Generally, D-H could use better organization / illustration.For instance, where does the right half of E fit in with respect to the architecture in D? Can you visually label where in D that F fits in (along EC-CA3?)?
Or is F more conceptual than the specific hippocampal model, of which EC-CA3 is only one case (as I believe they are indicating in G)?" To make the diagrams clearer, we have first separated the former Fig. 1 into Figs. 1 and 2, such that Fig. 2 can focus on the model.Furthermore, we have modified Fig. 2B such that it contains all network pathways, encoding and decoding.We then use colors and dashing to unambiguously label each pathway in both Fig. 2B and Fig. 2C-F.We have also inserted text in line 101 that hopefully improves understanding for Fig. 2D, E. Indeed, Fig. 2D, E refers to the projections from EC to DG, from DG to MF, and from EC to PP.In Fig. 2E, we also consider randomly generated x'pre instead of using x'EC to more comprehensively understand these projections on a conceptual level.
Minor point 1: "Abstract -first sentence: Using 'sensory' makes it seem as if the hippocampus does not store 'conceptual' information along with a memory trace." We have replaced "sensory information" with "experiences." Minor point 2: ""Meanwhile, recent research has found that the hippocampus also participates in semantic memory, as evidenced by grandmother cells that generalize over your visits and respond to many different representations of your grandmother (Quiroga et al., 2005(Quiroga et al., , 2009))."Norman et al. (2021, Science) is also relevant to this point."

Figure 1 -
Figure 1 -Generally, D-H could use befter organizafion / illustrafion.For instance, where does the right half of E fit in with respect to the architecture in D? Can you visually label where in D that F fits in (along EC-CA3?)?Or is F more conceptual than the specific hippocampal model, of which EC-CA3 is only one case (as I believe they are indicafing in G)?